The major parts of the following analyse have been inspired by the pipeline used in the article Rossi et al. Cell Stem Cell 2020. The associated code is available on github, at the Cardiac_Gastruloids repository.
The idea of the analysis is to analyze and compare the data obtained in the lab to the ones retrieved from the Rossi et al. article. The data used in the Rossi et al. article will be named “reference data” in the rest of the document. These data were recovered from the GEO website with the accession number GEO: GSE158999. The accessed files are:
barcodes.tsv.gz,
features.tsv.gz,
matrix.mtx.gz,
metadata.tsv.gz.
The Rossi et al. article also refers and uses the in vivo atlas from Pijuan-Sala Griffiths Gottgens et al. Nature 2019. Indeed, in Rossi et al, a classifier has been trained on the atlas data. I directly download the classifier from the above mentioned github repository.
I also downloaded the data of the atlas of Pijan-Sala et al. using a terminal with the following command lines
curl https://content.cruk.cam.ac.uk/jmlab/atlas_data.tar.gz > atlas_data.tar.gz
tar -zxvf atlas_data.tar.gz
To reproduce the analysis, it is recommended to set up all these directories. I assume that all the input files are stored in the same directory inputData. This directory is divided in 5 subdirectories:
2 of them are dedicated to the raw data to be analyzed, from the Marseille Medical Genetics (MMG) laboratory and reference datasets (zaffranRawData and rossiRawData, respectively),
1 directory contains the data to create an atlas object from Pijuan-Sala et al. (atlas),
1 directory stores the classifier provided by the reference Rossi et al. based on the atlas of Pijuan-Sala et al. (scmap_pijuansala_classifier),
1 directory contains input tables provided by the reference Rossi et al. (InputTables).
basePath <- "XXXXX" # to be seen as the root for any analysis based on a set of input data
# input paths
atlas.directory <- paste0(basePath, "inputData/atlas/")
classifier.folder <- paste0(basePath, "inputData/scmap_pijuansala_classifier/")
inputTables.folder <- paste0(basePath, "inputData/InputTables/")
# output paths
baseAnalysis <- paste0(basePath, "XXXXX")
if (!dir.exists(baseAnalysis)) {
dir.create(baseAnalysis)
}
rdsObjects <- paste0(baseAnalysis, "rdsObjects/")
if (!dir.exists(rdsObjects)) {
dir.create(rdsObjects)
}
atlas.folder <- paste0(baseAnalysis, "atlas/")
if (!dir.exists(atlas.folder)) {
dir.create(atlas.folder)
}
fig.folder <- paste0(baseAnalysis, "figures/")
if (!dir.exists(fig.folder)) {
dir.create(fig.folder)
}
To keep a coherent coloration on the plots through the analysis, I set up a vector of colors.
colors.table <- read.table(file = paste0(inputTables.folder, "ClusterColors.tsv"), sep = "\t", header = T,
comment.char = "", as.is = T)
colors.use_ab.initio <- setNames(colors.table$blind_friendly[!is.na(colors.table$ab_initio_identity)], colors.table$ab_initio_identity[!is.na(colors.table$ab_initio_identity)])
colors.use_transferred <- setNames(colors.table$blind_friendly[!is.na(colors.table$transferred_identity)],
colors.table$transferred_identity[!is.na(colors.table$transferred_identity)])
colors.use_stages <- setNames(c("#be465c", "#b04b3d", "#c86633", "#d0a63f", "#998138", "#90b03d", "#60a756",
"#45c097", "#5e8bd5", "#6d71d8", "#573585", "#bd80d5", "#b853a2", "#ba4b7d"), c("Day_04", "Day_05", "Day_06",
"Day_07", "Day_10", "E6.5", "E6.75", "E7.0", "E7.25", "E7.5", "E7.75", "E8.0", "E8.25", "E8.5"))
Some following functions can work with parallelization. Here I configure how parallelization should work.
# Parallelization
plan(strategy = "multisession", workers = 4)
# plan(batchtools_slurm, template =
# '/shared/projects/mothard_in_silico_modeling/slurm_scripts/aa_seuratAn.tmpl')
options(future.globals.maxSize = 62914560000) #60GB
options(future.seed = 26)
The Read10X function only needs the directory in which the input files have been stored. It creates an object interpreting the barcodes, features and matrix files.
The CreateSeuratObject function initialize a seurat object.
ref.dataset.folder <- paste0(basePath, "inputData/ref_data/")
ref.raw.data <- Read10X(data.dir = ref.dataset.folder, gene = 1)
ref.4d.SO <- CreateSeuratObject(counts = ref.raw.data, project = "recoverAnalysis") #, assay='RNA', min.cells=3, min.features=100)
# Dimensions of the object
head(ref.4d.SO)
dim(ref.4d.SO)
## An object of class Seurat
## 1 features across 30496 samples within 1 assay
## Active assay: RNA (1 features)
## [1] 23961 30496
The reference dataset contains 30496 cells on which 23961 features (genes) were detected.
The metadata file was created in the Rossi et al. analysis, and provided to guide users. I add the stage and Replicate metadata to the seurat object under the names day and replicate column name respectively.
ref.md <- read.table(file = paste0(ref.dataset.folder, "metadata.tsv.gz"), sep = "\t", header = TRUE, stringsAsFactors = F)
ref.4d.SO <- AddMetaData(object = ref.4d.SO, metadata = ref.md$stage, col.name = "day")
ref.4d.SO <- AddMetaData(object = ref.4d.SO, metadata = ref.md$Replicate, col.name = "replicate")
Here are all the metadata information contained in the seurat object of the reference dataset: str(ref.4d.SO@meta.data)
## 'data.frame': 30496 obs. of 5 variables:
## $ orig.ident : Factor w/ 1 level "GAS": 1 1 1 1 1 1 1 1 1 1 ...
## $ nCount_RNA : num 53183 41938 43631 69730 64691 ...
## $ nFeature_RNA: int 6676 5225 5869 7090 7039 5568 6208 5684 3450 6052 ...
## $ day : chr "Day4" "Day4" "Day4" "Day4" ...
## $ replicate : int 0 0 0 0 0 0 0 0 0 0 ...
There are 5 metadata stored in a data frame. Each of them describe the cells. Besides the imported metadata from the external file (day and replicate), there are n_Count_RNA and nFeature_RNA information:
nCount_RNA is the number of reads counted in every droplet with a gel bead associated to a cell,
nFeature_RNA is the number of different features (genes) detected in every droplet.
We can get the different values a metadata can take by the following command. Here is an example for the metadata named day: unique(ref.4d.SO@meta.data$day)
## [1] "Day4" "Day5" "Day6" "Day7"
day, dataset)For the sake of the analysis, I need to manipulate the labels to make them coherent between all the labels.
# modify day label
ref.4d.SO@meta.data$day <- gsub("^(Day)([0-9]*)$", "\\1_0\\2", ref.4d.SO@meta.data$day)
# create new metadata named dataset
ref.4d.SO$dataset <- paste0("ref_", tolower(ref.4d.SO@meta.data$day))
Idents(ref.4d.SO) <- ref.4d.SO$dataset
ref_dataset_names <- unique(ref.4d.SO$dataset)
A new metadata has been created. The values this metadata can take are among ref_day_04, ref_day_05, ref_day_06, ref_day_07 depending on the origin and the stage (day) of the sequenced cells.
Now here’s what the metadata looks like:
| orig.ident | nCount_RNA | nFeature_RNA | day | replicate | dataset | |
|---|---|---|---|---|---|---|
| GAS_Day4_batch4_AAACCCATCGACTCCT | GAS | 53183 | 6676 | Day_04 | 0 | ref_day_04 |
| GAS_Day4_batch4_AAACGAAGTCATAACC | GAS | 41938 | 5225 | Day_04 | 0 | ref_day_04 |
| GAS_Day4_batch4_AAACGCTCAAGCGAGT | GAS | 43631 | 5869 | Day_04 | 0 | ref_day_04 |
| GAS_Day4_batch4_AAACGCTCACCCTAAA | GAS | 69730 | 7090 | Day_04 | 0 | ref_day_04 |
| GAS_Day4_batch4_AAAGAACAGCCTTGAT | GAS | 64691 | 7039 | Day_04 | 0 | ref_day_04 |
| GAS_Day4_batch4_AAAGAACAGTTTCGAC | GAS | 39653 | 5568 | Day_04 | 0 | ref_day_04 |
For the following parts of the analysis, I will split the dataset ros.4d.SO into 4 sub-datasets. I use the SplitObject function from the Seurat package (PROVIDE REFERENCE). By providing an attribute (metadata column name), the object is splitted into a list of subsetted objects.
I split the dataset according to the day the gastruloids were sequenced.
# Split the object
refSO.list <- SplitObject(ref.4d.SO, split.by = "day")
# get dataset names
names(refSO.list) <- ref_dataset_names
# set up the project name of each sub-dataset
for (SO in refSO.list) {
projName <- unique(SO@meta.data$dataset)
refSO.list[[projName]]@project.name <- projName
}
Size of the entire dataset: dim(ref.4d.SO)
| Dataset | Nbr of features | Nbr of cells |
|---|---|---|
| ref_4_days | 23961 | 30496 |
Size of sub-datasets:
size_subref <- data.frame(cbind(sapply(refSO.list, function(SO) SO@project.name), t(sapply(refSO.list, dim))))
colnames(size_subref) <- c("Dataset", "Nbr of features", "Nbr of cells")
knitr::kable(size_subref, align = "lrr")
| Dataset | Nbr of features | Nbr of cells | |
|---|---|---|---|
| ref_day_04 | ref_day_04 | 23961 | 6898 |
| ref_day_05 | ref_day_05 | 23961 | 7783 |
| ref_day_06 | ref_day_06 | 23961 | 8313 |
| ref_day_07 | ref_day_07 | 23961 | 7502 |
The lab got the scRNAseq data day by day. The sequencing saturation was not good enough for the day 5. As the consequence, this day has been resequenced. It results in that I have five datasets: one dataset for the days 4, 6 and 10 and two datasets for the day 5.
For each dataset, I will execute the following steps:
load and interprete raw data (barcodes, features and matrix files) with Read10X function, and create a seurat object (CreateSeuratObject function),
add metadata and,
rename the cells.
There are five sub-datasets to load. I create a seurat object for each of them that I store in a list. The project name of each sub-dataset is already given.
# get path to the folders of each sub-dataset
dirs <- list.dirs(paste0(basePath, "inputData/lab_data"))
lab_dirs <- dirs[2:length(dirs)]
# create a list of seurat objects
labSO.list <- list()
lab_dataset_names <- c()
for (x in lab_dirs) {
dataset.name <- str_extract(x, "[a-z0-9_]*$")
lab_dataset_names <- append(lab_dataset_names, dataset.name)
raw.data <- Read10X(data.dir = x, gene = 2)
SO <- CreateSeuratObject(counts = raw.data, project = dataset.name) #, assay='RNA', min.cells=3, min.features=100)
labSO.list[[dataset.name]] <- SO
}
labSO.list
## $lab_day_04
## An object of class Seurat
## 32286 features across 7324 samples within 1 assay
## Active assay: RNA (32286 features)
##
## $lab_day_05
## An object of class Seurat
## 32286 features across 7794 samples within 1 assay
## Active assay: RNA (32286 features)
##
## $lab_day_05bis
## An object of class Seurat
## 32286 features across 7729 samples within 1 assay
## Active assay: RNA (32286 features)
##
## $lab_day_06
## An object of class Seurat
## 32286 features across 6628 samples within 1 assay
## Active assay: RNA (32286 features)
##
## $lab_day_11
## An object of class Seurat
## 32287 features across 9584 samples within 1 assay
## Active assay: RNA (32287 features)
As well as for the project names for the reference sub-datasets, the project name of the lab sub-datasets take the values among: lab_day_04, lab_day_05, lab_day_05bis, lab_day_06, lab_day_11
I need to add metadata like for the reference dataset. I will add the information of the day the gastruloids were sequenced, the dataset name - like the reference dataset name were built - and the replicate.
The replicate metadata is required as there are two datasets sequenced on the day 5. It will be useful to differenciate cells between both datasets.
labSO.list <- lapply(labSO.list, function(SO) {
dataset <- SO@project.name
day <- str_to_title(str_extract(dataset, "day_[0-9]*$"))
SO <- AddMetaData(object = SO, metadata = day, col.name = "day")
SO <- AddMetaData(object = SO, metadata = dataset, col.name = "dataset")
if (dataset == "lab_day_05bis") {
SO <- AddMetaData(SO, metadata = 1, col.name = "replicate")
} else {
SO <- AddMetaData(SO, metadata = 0, col.name = "replicate")
}
})
knitr::kable(data.frame(head(labSO.list[[1]]@meta.data)), align = "lrrrrrr")
The names of the cells have a different nomenclature between the reference and the MMG laboratory sub-datasets. On one hand, the reference cell names are already custom names with the biological origin (GAS for gastruloids), the days the gastruloid were sequenced, the batch number and the UMI. On the other hand, the laboratory cell names only consists on the UMIs.
The library used by 10X Genomics is limited, and provide a limited number of unique UMI barcodes. (PROVIDE DETAILS). Hence, working with data obtained from multiple sequencing experiments might lead to an overlap fo UMI barcodes between datasets.
Hence, I decide to create a nomenclature to rename all the cells under a unique identifier. Thus, each cell will be identified by
ref/lab whether the data come from Rossi et al. or the MMG laboratory,
the dataset value previously added to the metadata,
the replicate value, also present in the metadata and
the UMI provided by 10x Genomics.
rename_cells <- function(SO) {
cell_ids <- colnames(SO)
UMIs <- str_extract(cell_ids, "[A-Z]*$")
cellnames <- paste(SO$dataset, SO$replicate, UMIs, sep = "_")
SO <- RenameCells(SO, new.names = cellnames)
}
refSO.list <- lapply(refSO.list, rename_cells)
labSO.list <- lapply(labSO.list, rename_cells)
print(colnames(refSO.list$ref_day_04)[1])
print(colnames(labSO.list$lab_day_04)[1])
## [1] "ref_day_04_0_AAACCCATCGACTCCT"
## [1] "lab_day_04_0_AAACCCAAGTACAGAT"
We can see that there is as many unique cell identifiers as the number of cells among all the sub-datasets.
nbCells <- Reduce("+", lapply(refSO.list, function(SO) dim(SO)[2])) + Reduce("+", lapply(labSO.list, function(SO) dim(SO)[2]))
nbIDs <- Reduce("+", lapply(refSO.list, function(SO) length(unique(colnames(SO))))) + Reduce("+", lapply(labSO.list,
function(SO) length(unique(colnames(SO)))))
print(paste("Nbr of cells among all the sub-datasets:", nbCells))
print(paste("Nbr of cell identifiers:", nbIDs))
## [1] "Nbr of cells among all the sub-datasets: 69555"
## [1] "Nbr of cell identifiers: 69555"
Now, all the sub-datasets are prepared for the analysis. First, I will proceed to the quality control of the cells, after what I will realize the standard preprocessing workflow of Seurat. Then, I will be able to remove the doublets using DoubletFinder library (PROVIDE SOURCE). Finally, I will investigate if there is sources of unwanted variation among the replicates.
First and foremost, I gather all the data into a common object. Thus, they are stored in a list. The computational advantage of it takes place in the use of the function of lapply() . It allows to apply a function to all members of a list in one command.
All the sub-datasets (reference and lab data) will be analyzed on the same pipeline.
allSO.list <- do.call(c, list(refSO.list, labSO.list))
# allSO.list <- append(refSO.list, labSO.list) # Equivalent allSO.list <- c(refSO.list, labSO.list) #
# Equivalent
SO.names <- names(allSO.list)
rm(refSO.list, labSO.list)
gc(verbose = FALSE)
str(allSO.list, max.level = 2)
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 6377269 340.6 11440542 611.0 8804052 470.2
## Vcells 990855682 7559.7 1475139588 11254.5 1339408159 10218.9
## List of 9
## $ ref_day_04 :Formal class 'Seurat' [package "Seurat"] with 12 slots
## $ ref_day_05 :Formal class 'Seurat' [package "Seurat"] with 12 slots
## $ ref_day_06 :Formal class 'Seurat' [package "Seurat"] with 12 slots
## $ ref_day_07 :Formal class 'Seurat' [package "Seurat"] with 12 slots
## $ lab_day_04 :Formal class 'Seurat' [package "Seurat"] with 12 slots
## $ lab_day_05 :Formal class 'Seurat' [package "Seurat"] with 12 slots
## $ lab_day_05bis:Formal class 'Seurat' [package "Seurat"] with 12 slots
## $ lab_day_06 :Formal class 'Seurat' [package "Seurat"] with 12 slots
## $ lab_day_11 :Formal class 'Seurat' [package "Seurat"] with 12 slots
Now, all sub-datasets Seurat objects are stored into a list. The names of the list are the project.name of each sub-dataset.
As the number of cells could decrease during the following steps, I will print a table with the number of cells, but also the number of detected features in each sub-dataset. A barplot will be also displayed to see the evolution of each sub-dataset along the analysis.
Size of all the sub-datasets:
size_suball <- data.frame(cbind(sapply(allSO.list, function(SO) {
if (str_detect(SO@project.name, "^ref")) {
"ref"
} else {
"lab"
}
}), sapply(allSO.list, function(SO) {
if (str_detect(SO@project.name, "04")) {
"4"
} else if (str_detect(SO@project.name, "05bis")) {
"5bis"
} else if (str_detect(SO@project.name, "05")) {
"5"
} else if (str_detect(SO@project.name, "06")) {
"6"
} else if (str_detect(SO@project.name, "07")) {
"7"
} else if (str_detect(SO@project.name, "11")) {
"11"
}
}), t(sapply(allSO.list, dim)), "Preliminary Counts", "01"))
colnames(size_suball) <- c("origin", "day", "Nbr_of_features", "Nbr_of_cells", "Analysis_step_name", "Step_number")
size_suball$Nbr_of_cells <- as.integer(as.character(size_suball$Nbr_of_cells))
size_suball$Nbr_of_features <- as.integer(as.character(size_suball$Nbr_of_features))
size_suball$day <- factor(size_suball$day, levels = c("4", "5", "5bis", "6", "7", "11"), ordered = TRUE)
knitr::kable(size_suball, align = "lrrrrlr")
| origin | day | Nbr_of_features | Nbr_of_cells | Analysis_step_name | Step_number | |
|---|---|---|---|---|---|---|
| ref_day_04 | ref | 4 | 23961 | 6898 | Preliminary Counts | 01 |
| ref_day_05 | ref | 5 | 23961 | 7783 | Preliminary Counts | 01 |
| ref_day_06 | ref | 6 | 23961 | 8313 | Preliminary Counts | 01 |
| ref_day_07 | ref | 7 | 23961 | 7502 | Preliminary Counts | 01 |
| lab_day_04 | lab | 4 | 32286 | 7324 | Preliminary Counts | 01 |
| lab_day_05 | lab | 5 | 32286 | 7794 | Preliminary Counts | 01 |
| lab_day_05bis | lab | 5bis | 32286 | 7729 | Preliminary Counts | 01 |
| lab_day_06 | lab | 6 | 32286 | 6628 | Preliminary Counts | 01 |
| lab_day_11 | lab | 11 | 32287 | 9584 | Preliminary Counts | 01 |
ggplot(size_suball, aes(x = day, y = Nbr_of_cells)) + geom_col(aes(color = origin, fill = origin), position = position_dodge(),
width = 0.7) + theme_classic() + theme(plot.title = element_text(hjust = 0.5)) + ggtitle("Number of cells in each sub-dataset") +
facet_grid(. ~ origin)
The cells can be filtered out based on multiple criteria, like:
the number of features detected in a cell transcriptome,
the number of counts for each single cell or,
the percentage of mitochondrial counts.
Thus, the feature plot of the nCount_RNA an the percent.mt metadata will guide me regarding the threshold to apply for filtering out the low quality cells.
To determine the mitochondrial percentage in each cell, I use the mitochondrial gene name that always begins by “mt-”. The function PercentageFeatureSet enables to calculate the percentage of all counts that match the pattern. Here, I use the “mt-” pattern. (REFORMULATE)
allSO.list <- future_lapply(allSO.list, function(SO) {
SO[["percent.mt"]] <- PercentageFeatureSet(SO, pattern = "^mt-")
return(SO)
}, future.seed = 26)
The percentage of mitochondrial counts (or reads) is stored for each cell as a metadata under the name of percent.mt (see the example on the data of the lab, day 4).
| orig.ident | nCount_RNA | nFeature_RNA | day | dataset | replicate | percent.mt | |
|---|---|---|---|---|---|---|---|
| lab_day_04_0_AAACCCAAGTACAGAT | lab_day_04 | 59952 | 6798 | Day_04 | lab_day_04 | 0 | 2.340206 |
| lab_day_04_0_AAACCCAAGTTCATCG | lab_day_04 | 38805 | 5932 | Day_04 | lab_day_04 | 0 | 1.154490 |
| lab_day_04_0_AAACCCACAAGCGAAC | lab_day_04 | 45520 | 6008 | Day_04 | lab_day_04 | 0 | 1.001758 |
| lab_day_04_0_AAACCCACAAGGTTGG | lab_day_04 | 26327 | 4497 | Day_04 | lab_day_04 | 0 | 1.762449 |
| lab_day_04_0_AAACCCACACATATGC | lab_day_04 | 28763 | 4954 | Day_04 | lab_day_04 | 0 | 2.718771 |
| lab_day_04_0_AAACCCACACTAAACC | lab_day_04 | 27481 | 5004 | Day_04 | lab_day_04 | 0 | 3.114879 |
lapply(allSO.list, function(SO) {
VlnPlot(SO, features = c("percent.mt"), ncol = 3) + geom_hline(aes(yintercept = 10), color = "blue", size = 1.5) +
geom_hline(aes(yintercept = 1.5), color = "orange", size = 1.5)
})
The reference sub-datasets show a mitochondrial content that never exceeds 15% of the transcripts in cells. In addition, cells never have less than 1.5% of mitochondrial content. It is the reason why there is no orange line on the 4 first plots (plots on reference sub-datasets). One can suggest that the data were already filtered. In contrast, some cells in the lab data have a mitochondrial content close to 80%.
I will keep the cells whose percentage is between 1.5% and 10% (orange and blue lines, resp.).
Information coming from 10X Genomics
Apoptotic cells express mitochondrial genes and export these transcripts to the cytoplasm in mammalian cells. Lysed cells with intact mitochondria may be partitioned into GEMs, increasing the fraction of mitochondrial transcripts detected.
It is also mentioned in Luecken MD and Theis FJ Mol Syst Biol. 2019 that a high mitochondrial content may represent doublets.
Rossi et al. was filtering out cells with a number of read counts lower than 2000 (green line). Added to this one, I will also remove the cells with a number of read counts higher than 150.000 (purple line).
lapply(allSO.list, function(SO) {
VlnPlot(SO, features = c("nCount_RNA"), ncol = 3) + geom_hline(aes(yintercept = 150000), color = "purple",
size = 1.5) + geom_hline(aes(yintercept = 3000), color = "green", size = 1.5)
})
The reference sub-datasets never have less than 3000 read counts. It is the reason why there is no purple line on the 4 first plots (plots on reference sub-datasets). Actually, it appears that they never have less than 5000 read counts.
Furthermore, the higher threshold leads to filter out some outliers either in reference and laboratory sub-datasets. The 2 datasets of the laboratory fifth day sequencing do not reach such number read counts.
The number of feature per cell is also assessed for quality control.
lapply(allSO.list, function(SO) {
VlnPlot(SO, features = c("nFeature_RNA"), ncol = 3) + geom_hline(aes(yintercept = 1800), color = "#56B4E9",
size = 1.5)
})
The reference sub-datasets never have less than 2000 features per cell, while the lab sub-datasets range from 38 to 10064 features per cell. Many cells of the lab sub-datasets show a low number of detected features. Thus, a minimum of 1800 detected features per cell will be requested.
# lapply(allSO.list, function(SO) { print(names(SO@meta.data)) FeatureScatter(object=SO,
# feature1='nCount_RNA', feature2='nFeature_RNA', pt.size=0.8, cols='percent.mt') + geom_point(data =
# data.frame(SO@meta.data$nCount_RNA, SO@meta.data$nFeature_RNA, SO@meta.data$percent.mt), aes(x =
# 'nCount_RNA', y = 'nFeature_RNA'), color='percent.mt') + geom_hline(aes(yintercept = 1500),
# color='#56B4E9', size=1) + geom_vline(aes(xintercept = 150000), color='purple', size=1) +
# geom_vline(aes(xintercept = 3000), color='green', size=1) + ggtitle(paste0('Nb of features as function
# of Nb of counts\n', SO@project.name)) + scale_color_continuous(type = 'viridis') })
lapply(allSO.list, function(SO) {
ggplot(SO@meta.data, aes(nCount_RNA, nFeature_RNA, colour = percent.mt)) + geom_point() + lims(colour = c(0,
100)) + theme_minimal() + theme(plot.title = element_text(hjust = 0.5)) + ylab("nFeature_RNA") + geom_hline(aes(yintercept = 1800),
color = "#56B4E9", size = 1) + geom_vline(aes(xintercept = 150000), color = "purple", size = 1) +
geom_vline(aes(xintercept = 3000), color = "green", size = 1) + ggtitle(paste0("Nb of features as function of Nb of counts\n",
SO@project.name))
})
Considering the multiple quality control criteria all together, I confirm the previously settled thresholds. The low quality cells will be removed. The cells with a high content of mitochondrial expressed genes are under the blue line (minimum of 1800 features), and also at the left of the green line (3.000 read counts minimum). Such cell with a low count depth (count per cell), few detected features and a high mitochondrial content are indicative of cells whose cytoplasmic mRNA has leaked out through a broken membrane, and thus, only mRNA located in the mitochondria is still conserved (Luecken MD and Theis FJ Mol Syst Biol. 2019).
Among the thousands of detected features, some of them are expressed in few cells. Such features do not provide information on the cell heterogeneity. It is also good to know that filtering out features expressed in less than 50 cells (red dashed line) will make difficult to detect clusters that could gather less than 50 cells.
# Create dataframe
countcells.df.list <- lapply(allSO.list, function(SO) {
df <- data.frame(rowSums(SO@assays$RNA@counts != 0))
df$features <- rownames(df)
colnames(df) <- c("Nbr_of_cells", "features")
rownames(df) <- NULL
return(df)
})
lapply(1:length(countcells.df.list), function(i) {
project.name <- names(countcells.df.list)[i]
df <- countcells.df.list[[i]]
# Plot gene expression repartition
ggplot(df, aes(x = Nbr_of_cells)) + geom_histogram() + theme_minimal() + theme(plot.title = element_text(hjust = 0.5)) +
ylab("frequency") + scale_x_continuous(trans = "log10") + expand_limits(x = c(0, 10500), y = c(0,
2000)) + geom_vline(aes(xintercept = 10), color = "green", size = 1) + geom_vline(aes(xintercept = 50),
color = "red", size = 1, linetype = "dashed") + ggtitle(paste0("Repartition of expressed features\n",
project.name))
})
In order to limit the number of dropouts, and still, being able to identify small clusters, the features expressed in less than 10 cells will be removed.
In this section, I apply all the cutoffs determined in the previous steps. As a result, the sub-datasets will have either less features and cells.
allSO.list <- lapply(1:length(allSO.list), function(i) {
df <- countcells.df.list[[i]]
SO <- allSO.list[[i]]
SO <- subset(SO, features = which(df$Nbr_of_cells > 10), subset = percent.mt > 1.5 & percent.mt < 10 &
nCount_RNA > 3000 & nCount_RNA < 150000 & nFeature_RNA > 1800)
return(SO)
})
names(allSO.list) <- c(ref_dataset_names, lab_dataset_names)
Size of all the sub-datasets after filtering:
size_suball2 <- data.frame(cbind(
sapply(allSO.list, function(SO){
if (str_detect(SO@project.name, "^ref")){ "ref" } else { "lab" }
}),
sapply(allSO.list, function(SO){
if (str_detect(SO@project.name, "04")) { "4" }
else if (str_detect(SO@project.name, "05bis")) { "5bis" }
else if (str_detect(SO@project.name, "05")) { "5" }
else if (str_detect(SO@project.name, "06")) { "6" }
else if (str_detect(SO@project.name, "07")) { "7" }
else if (str_detect(SO@project.name, "11")) { "11" }
}),
t(sapply(allSO.list, dim)),
"QC_filtering",
"02"
))
colnames(size_suball2) <- c('origin', 'day', 'Nbr_of_features', 'Nbr_of_cells', 'Analysis_step_name', 'Step_number')
size_suball2$Nbr_of_cells <- as.integer(as.character(size_suball2$Nbr_of_cells))
size_suball2$Nbr_of_features <- as.integer(as.character(size_suball2$Nbr_of_features))
size_suball2$day <- factor(size_suball2$day, levels = c('4', '5', '5bis', '6', '7', '11'), ordered = TRUE)
size_suball_track <- rbind(size_suball, size_suball2)
knitr::kable(size_suball_track) %>%
scroll_box(width = "100%", height = "200px")
rm(size_suball, size_suball2)
| origin | day | Nbr_of_features | Nbr_of_cells | Analysis_step_name | Step_number | |
|---|---|---|---|---|---|---|
| ref_day_04 | ref | 4 | 23961 | 6898 | Preliminary Counts | 01 |
| ref_day_05 | ref | 5 | 23961 | 7783 | Preliminary Counts | 01 |
| ref_day_06 | ref | 6 | 23961 | 8313 | Preliminary Counts | 01 |
| ref_day_07 | ref | 7 | 23961 | 7502 | Preliminary Counts | 01 |
| lab_day_04 | lab | 4 | 32286 | 7324 | Preliminary Counts | 01 |
| lab_day_05 | lab | 5 | 32286 | 7794 | Preliminary Counts | 01 |
| lab_day_05bis | lab | 5bis | 32286 | 7729 | Preliminary Counts | 01 |
| lab_day_06 | lab | 6 | 32286 | 6628 | Preliminary Counts | 01 |
| lab_day_11 | lab | 11 | 32287 | 9584 | Preliminary Counts | 01 |
| ref_day_041 | ref | 4 | 17599 | 6726 | QC_filtering | 02 |
| ref_day_051 | ref | 5 | 17973 | 7637 | QC_filtering | 02 |
| ref_day_061 | ref | 6 | 17918 | 8133 | QC_filtering | 02 |
| ref_day_071 | ref | 7 | 17951 | 7385 | QC_filtering | 02 |
| lab_day_041 | lab | 4 | 17268 | 5518 | QC_filtering | 02 |
| lab_day_051 | lab | 5 | 18368 | 6728 | QC_filtering | 02 |
| lab_day_05bis1 | lab | 5bis | 18184 | 6689 | QC_filtering | 02 |
| lab_day_061 | lab | 6 | 17547 | 5196 | QC_filtering | 02 |
| lab_day_111 | lab | 11 | 17952 | 6398 | QC_filtering | 02 |
ggplot(size_suball_track, aes(x = day, y = Nbr_of_cells)) + geom_col(aes(color = Analysis_step_name, fill = Analysis_step_name),
position = position_dodge(), width = 0.6) + theme_classic() + theme(plot.title = element_text(hjust = 0.5)) +
ggtitle("Number of cells in each sub-dataset\nafter QC") + facet_grid(. ~ origin)
SAY SOMETHING THAT ROSSI ET AL DATA WERE ALREADY PREPROCESSED
To compare the gene expression across multiple cells, the data have to be normalized. I use the default parameter of the NormalizeData function. It means that the method used is called LogNormalize. With this method, the feature counts for each cell are divided by the total counts that cell and multiplied by the scale.factor (10.000, by default). This is then natural-log transformed using log1p.
allSO.list <- future_lapply(allSO.list, NormalizeData, verbose = FALSE, future.seed = 26)
The normalization is calculated from the raw counts slot named counts. The result is then stored in an another slot named data.
knitr::kable(data.frame(GetAssayData(allSO.list$ref_day_04, slot = "counts")[1:5, 1:5]), caption = "Raw counts in 'counts' slot") %>%
scroll_box(width = "800px", height = "200px")
knitr::kable(data.frame(GetAssayData(allSO.list$ref_day_04, slot = "data")[1:5, 1:5]), caption = "Normalized counts in 'data' slot") %>%
scroll_box(width = "800px", height = "200px")
| ref_day_04_0_AAACCCATCGACTCCT | ref_day_04_0_AAACGAAGTCATAACC | ref_day_04_0_AAACGCTCAAGCGAGT | ref_day_04_0_AAACGCTCACCCTAAA | ref_day_04_0_AAAGAACAGCCTTGAT | |
|---|---|---|---|---|---|
| Xkr4 | 0 | 0 | 0 | 0 | 0 |
| Rp1 | 0 | 0 | 0 | 0 | 0 |
| Sox17 | 5 | 0 | 0 | 0 | 0 |
| Mrpl15 | 9 | 5 | 6 | 16 | 9 |
| Lypla1 | 1 | 0 | 1 | 5 | 0 |
| ref_day_04_0_AAACCCATCGACTCCT | ref_day_04_0_AAACGAAGTCATAACC | ref_day_04_0_AAACGCTCAAGCGAGT | ref_day_04_0_AAACGCTCACCCTAAA | ref_day_04_0_AAAGAACAGCCTTGAT | |
|---|---|---|---|---|---|
| Xkr4 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 |
| Rp1 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 |
| Sox17 | 0.6628109 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 |
| Mrpl15 | 0.9904438 | 0.7849221 | 0.8650819 | 1.1922741 | 0.8718615 |
| Lypla1 | 0.1723114 | 0.0000000 | 0.2063636 | 0.5406086 | 0.0000000 |
I continue with the standard preprocessing steps, using default parameters:
highly variable features (HVG) detection, with the function FindVariableFeatures , where 2000 features are identified as having the most cell-to-cell variation,
scale the data, via a linear transformation, using the ScaleData function, on all features (more details below),
perform linear reduction dimension with Principal Component Analysis (PCA) to identify the sources of heterogeneity in every sub-dataset, through 50 principal components (PCs),
plot an ElbowPlot to visualize the most informative PCs.
allSO.list <- future_lapply(allSO.list, function(SO) {
SO <- FindVariableFeatures(SO, nfeatures = 2000, verbose = FALSE)
SO <- ScaleData(SO, features = rownames(SO), verbose = FALSE)
SO <- RunPCA(SO, verbose = FALSE)
return(SO)
}, future.seed = 26)
The function ScaleData performs a linear transformation (so called scaling). In detail, it shifts gene expressions to make the mean at 0 (negative values arise), and scales the gene expressions to have their variance equal to 1. Concretely, it standardizes the expression of each gene. This step gives equal weight in downstream analyses, so that highly-expressed genes do not dominate (<https://satijalab.org/seurat/articles/pbmc3k_tutorial.html>).
From the scaling step, a new slot named scale.data has been created. Below is an example of the same features in the same cells as above.
knitr::kable(data.frame(GetAssayData(allSO.list$ref_day_04, slot = "scale.data")[1:5, 1:5]), caption = "Scaled data in 'scale.data' slot") %>%
scroll_box(width = "800px", height = "200px")
| ref_day_04_0_AAACCCATCGACTCCT | ref_day_04_0_AAACGAAGTCATAACC | ref_day_04_0_AAACGCTCAAGCGAGT | ref_day_04_0_AAACGCTCACCCTAAA | ref_day_04_0_AAAGAACAGCCTTGAT | |
|---|---|---|---|---|---|
| Xkr4 | -0.2033840 | -0.2033840 | -0.2033840 | -0.2033840 | -0.2033840 |
| Rp1 | -0.0341485 | -0.0341485 | -0.0341485 | -0.0341485 | -0.0341485 |
| Sox17 | 1.7263643 | -0.2826478 | -0.2826478 | -0.2826478 | -0.2826478 |
| Mrpl15 | -0.0977819 | -0.7710953 | -0.5084824 | 0.5634377 | -0.4862717 |
| Lypla1 | -0.3483848 | -1.1274818 | -0.1944192 | 1.3168522 | -1.1274818 |
The PCA was performed with the identified HVG as input. The Elbow plots below show for each sub-datasets how much each PC is informative - in term of the percentage of variance.
In the analysis realized by Rossi et al., the PCA was done to compute 30 PCs only. All the downstream analysis was considering all the PCs. Here, I use the default parameters of the RunPCA function. Thus, it computes 50 PCs.
lapply(allSO.list, function(SO) {
ElbowPlot(SO, ndims = 50) + theme_minimal() + theme(plot.title = element_text(hjust = 0.5)) + geom_vline(aes(xintercept = 30),
color = "purple", size = 1) + ggtitle(paste0("ElbowPlot\n", SO@project.name))
})
By taking into account the advises from the Seurat - Guided Clustering Tutorial vignette, the downstream analysis will be done on the 30 first PCs for all the sub-datasets.
Then, I go through the community detection among the cells. The edges are constructed using a k-nearest neighbors (KNN) graph, based on the euclidean distances previously obtained on the PCA space. The weights of the edges are based on the shared overlap in their local neighborhood, method call Jaccard similarity. These 2 steps are done by using the FindNeighbors function. (https://satijalab.org/seurat/articles/pbmc3k_tutorial.html)
Once I obtain a graph on the data, I run the FindClusters function. It applies the Louvain algorithm (by default). This algorithm optimizes the modularity of the graph.
Originally, the analysis realized by Rossi et al. set the resolution parameter at 4.6. Here, I test the Louvain algorithm according to multiple levels of resolution (0.1, 0.5, 1, 2 and 5).
res <- c(0.1, 0.5, 1, 2, 5)
allSO.list <- future_lapply(allSO.list, function(SO) {
SO <- FindNeighbors(SO, dims = 1:30, verbose = FALSE)
SO <- FindClusters(SO, resolution = res, verbose = FALSE)
return(SO)
}, future.seed = 26)
The relationship between the resolution’s levels and the number of clusters is presented below.
data <- c()
resolution <- c()
nb_clusters <- c()
for (SO in allSO.list) {
for (ares in res) {
nb_levels <- nlevels(SO@meta.data[[paste0("RNA_snn_res.", ares)]])
data <- append(data, SO@project.name)
resolution <- append(resolution, ares)
nb_clusters <- append(nb_clusters, nb_levels)
}
}
df <- data.frame(data, resolution, nb_clusters)
knitr::kable(pivot_wider(df, names_from = "data", values_from = "nb_clusters"), caption = "Number of clusters in sub-datasets\naccording to the resolution level")
| resolution | ref_day_04 | ref_day_05 | ref_day_06 | ref_day_07 | lab_day_04 | lab_day_05 | lab_day_05bis | lab_day_06 | lab_day_11 |
|---|---|---|---|---|---|---|---|---|---|
| 0.1 | 5 | 5 | 6 | 7 | 3 | 4 | 4 | 5 | 10 |
| 0.5 | 10 | 10 | 11 | 14 | 10 | 8 | 8 | 14 | 15 |
| 1.0 | 14 | 14 | 17 | 18 | 13 | 12 | 14 | 18 | 22 |
| 2.0 | 24 | 25 | 24 | 26 | 23 | 22 | 21 | 30 | 32 |
| 5.0 | 42 | 44 | 43 | 47 | 47 | 44 | 46 | 48 | 48 |
By running a non-linear dimension reduction such as the Uniform Manifold Approximation and Projection (UMAP), he clusters can be visualized.
allSO.list <- lapply(allSO.list, RunUMAP, dims = 1:30, verbose = FALSE)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
for (ares in res) {
print(DimPlot(allSO.list$lab_day_11, group.by = paste0("RNA_snn_res.", ares), label = TRUE, reduction = "umap") +
ggtitle(paste0("Resolution level: ", ares, "\n", allSO.list$lab_day_11@project.name)) + theme(plot.title = element_text(hjust = 0.5)))
}
## Warning: Using `as.character()` on a quosure is deprecated as of rlang 0.3.0.
## Please use `as_label()` or `as_name()` instead.
## This warning is displayed once per session.
For the sub-dataset day_11 of the laboratory, it appears that the best value for the resolution parameter should be at 1. I set the Idents of all sub-datasets to the clusters defined under the resolution level 1.
allSO.list <- lapply(allSO.list, function(SO) {
Idents(SO) <- SO$RNA_snn_res.1
print(DimPlot(SO, label = TRUE, reduction = "umap") + ggtitle(paste0("Resolution level: 1\n", SO@project.name)) +
theme(plot.title = element_text(hjust = 0.5)))
return(SO)
})
gc()
## pK Identification (no ground-truth) ---------------------------------------------------------------------------------------
pK.list <- sapply(allSO.list,
function(SO){
sweep.res.list_SO <- paramSweep_v3(SO, PCs = 1:30) # as estimated from PC elbowPlot
sweep.stats_SO <- summarizeSweep(sweep.res.list_SO, GT = FALSE)
bcmvn_SO <- find.pK(sweep.stats_SO)
ggplot(bcmvn_SO, aes(pK, BCmetric, group = 1)) +
geom_point() +
geom_line()
pK <- bcmvn_SO %>% # select the pK that corresponds to max bcmvn to optimize doublet detection
filter(BCmetric == max(BCmetric)) %>%
select(pK)
pK <- as.numeric(as.character(pK[[1]]))
return(pK)
})
## Load pijuansala classifier (/!\ not AI) -----------------------------------------------------------------------------------
scmap_classifier <- readRDS(file=paste0(classifier.folder, "scmap_classifier_1000markers.rds"))
ref_stages <- c("E6.5", "E6.75", "E7.0", "E7.25", "E7.5", "E7.75", "E8.0", "E8.25", "E8.5")
## Annotate cells from pijuansala classifier ---------------------------------------------------------------------------------
allSO.list <- lapply(allSO.list,
function(SO){
set.seed(1234567)
# create the coresponding SingleCellExperiment object
sce <- as.SingleCellExperiment(x=SO)
rowData(sce)$feature_symbol <- rownames(sce)
counts(sce) <- as.matrix(counts(sce))
logcounts(sce) <- as.matrix(logcounts(sce))
# apply scmapCluster
scmapCluster_results <- scmapCluster(projection=sce, index_list=scmap_classifier[ref_stages], threshold=0)
# add the celltype from the in vivo atlas to the Seurat object
SO <- AddMetaData(object=SO, metadata=scmapCluster_results$combined_labs, col.name="celltype_DF")
# save memory, do garbage collection
rm(sce)
gc()
return(SO)
})
lapply(allSO.list,
function(SO){
Idents(SO) <- SO$celltype_DF
DimPlot(SO, pt.size = 0.8, label = T, label.size = 6, repel = T,
cols = colors.use_transferred[levels(Idents(SO))]) +
NoLegend() +
ggtitle(SO@project.name) +
theme(plot.title = element_text(hjust = 0.5))
}
)
## Homotypic Doublet Proportion Estimate -------------------------------------------------------------------------------------
dblts_nonhomo <- sapply(allSO.list,
function(SO){
# based on annotation (celltype from pijuansala classifier)
# could be also the louvain clusters determined from the preprocessus step
annotations <- SO@meta.data$celltype_DF
homotypic.prop <- modelHomotypic(annotations)
nDoublets <- round(ncol(SO)*7.6/100)
nDoublets_nonhomo <- round(nDoublets*(1-homotypic.prop))
return(nDoublets_nonhomo)
})
# run doubletFinder ----------------------------------------------------------------------------------------------------------
allSO.list <- lapply(allSO.list,
function(SO){
SO <- doubletFinder_v3(SO,
PCs = 1:30,
pN = 0.25,
pK = pK.list[[SO@project.name]],
nExp = dblts_nonhomo[[SO@project.name]])
})
# remove the cells tagged as doublet -----------------------------------------------------------------------------------------
allSO.list <- lapply(allSO.list,
function(SO){
# remove the cells identified as doublets of each of the sub-datasets
col_dblts <- grep("DF.classifications", colnames(SO@meta.data), value=TRUE)
Idents(SO) <- col_dblts
SO <- subset(SO, idents='Singlet')
# remove useless columns
SO@meta.data[[grep("DF.classifications", colnames(SO@meta.data))]] <- NULL
names(SO@meta.data)[names(SO@meta.data) == grep("pANN",colnames(SO@meta.data),value=T)] <- "pANN"
return(SO)
})
## Check size of the subdatasets ---------------------------------------------------------------------------------------------
size_suball3 <- data.frame(cbind(
sapply(allSO.list, function(SO){
if (str_detect(SO@project.name, "^ref")){ "ref" } else { "lab" }
}),
sapply(allSO.list, function(SO){
if (str_detect(SO@project.name, "04")) { "4" }
else if (str_detect(SO@project.name, "05bis")) { "5bis" }
else if (str_detect(SO@project.name, "05")) { "5" }
else if (str_detect(SO@project.name, "06")) { "6" }
else if (str_detect(SO@project.name, "07")) { "7" }
else if (str_detect(SO@project.name, "11")) { "11" }
}),
t(sapply(allSO.list, dim)),
"Removed Doublets",
"03"
))
colnames(size_suball3) <- c('origin', 'day', 'Nbr_of_features', 'Nbr_of_cells', 'Analysis_step_name', 'Step_number')
size_suball3$Nbr_of_cells <- as.integer(as.character(size_suball3$Nbr_of_cells))
size_suball3$Nbr_of_features <- as.integer(as.character(size_suball3$Nbr_of_features))
size_suball3$day <- factor(size_suball3$day, levels = c('4', '5', '5bis', '6', '7', '11'), ordered = TRUE)
size_suball_track <- rbind(size_suball_track, size_suball3)
knitr::kable(size_suball_track)
rm(size_suball3)
ggplot(size_suball_track, aes(x = day, y = Nbr_of_cells)) +
geom_col(aes(color = Analysis_step_name, fill = Analysis_step_name), position = position_dodge(), width = 0.6) +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5)) +
ggtitle("Number of cells in each sub-dataset\nafter doublets removal") +
facet_grid(.~ origin)
## back to preprocessing analysis recluster cells after removing doublets ----------------------------------------------------
res <- seq(0.8, 1.6, 0.2)
allSO.list <- lapply(allSO.list,
function(SO){
SO<- NormalizeData(SO, verbose=FALSE)
SO<- FindVariableFeatures(SO, nfeatures=2000, verbose=FALSE)
SO<- ScaleData(SO, features=rownames(SO), verbose=FALSE)
SO<- RunPCA(SO, verbose=FALSE)
ElbowPlot(SO, ndims = 50) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) +
geom_vline(aes(xintercept = 30), color="purple", size=1) +
ggtitle(paste0("ElbowPlot\n", SO@project.name))
SO<- FindNeighbors(SO, dims = 1:30, verbose=FALSE)
SO<- FindClusters(SO, resolution = res, verbose=FALSE)
SO <- RunUMAP(SO, dims = 1:30)
for (ares in res){
print(DimPlot(SO, group.by = paste0("RNA_snn_res.", ares), label = TRUE, reduction = "umap") +
ggtitle(paste0("Resolution level: ", ares, "\n", SO@project.name)) +
theme(plot.title = element_text(hjust=0.5)))
}
return(SO)
})